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ABSTRACT 

A class of finite-memory deterministic algorithms is introduced and 
investigated. Optimum algorithms are found for a small number of states 
(up to 21) and an asymptotic bound on error probability is obtained for 
a large number of states. The algorithms provide their own stopping 
rule . 
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I. INTRODUCTION 



Computers seem to grow more complex and more sophisticated on almost 
a continuous basis. A logical future step in computer development would 
be to provide the computer with some decision making capability. This 
capability of necessity would be limited by the size of the computer core. 
Such a property would be of particular value if the computer were design- 
ed to operate without human assistance. For example, exploration of the 
nearer stars could most easily be accomplished by unmanned spacecraft. 

Yet the immense distances preclude human involvement in any decision pro- 
cess. If a computer with a decision making capability were to be used 
within the spacecraft, it would almost certainly be small and of very 
limited core size. Such an automaton would be required to make decisions 
with minimum probability of error and constrained by available memory. 

One form of a decision process could probably be adapted from the statis- 
tical test of hypothesis. 

In testing a hypothesis, the statistician normally forms the likeli- 
hood ratio and reduces- the ratio to a sufficient statistic which is to be 
less than or greater than some constant k. For a sample of size n; ot^ 
(Probability of Type I error) and (Probability of Type II error) will 
exponentially approach zero as n becomes large. To apply the procedure 
at time n means that sufficient memory must be available at time n to 
record the observations . ,X . In even the simplest cases this 

memory must grow indefinitely with time. Summarizing the data in a suf- 
ficient statistic does not insure that the memory requirement will be re- 
duced. The sufficient statistic is data reducing only in that it maps 
the observations from R n to R, but the cardinality of the memory may be 
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at least as great. For example, suppose that an experimenter wishes to 
estimate p, for a Normal random variable and uses x as a sufficient sta- 
tistic. Further suppose that in 99 repeated trials the experimenter finds 
that Ex = 1.000. Then x = 1.0/99 = 0.0101...., and the memory requirement 
has become infinite. , It would be tempting to round off the statistic to 
some finite dimension; however, Cover [l] has shown that and (3^ then 
do not tend to zero. Additionally, the rounded off statistic may not 
converge to the same distribution as the estimated parameter. If con- 
strained by finite memory, some other model must be devised. One possible 
approach is to use only the last k observations. This idea has been in- 
vestigated by Robbins [2]. 

Although not originally intended as a method for testing hypotheses, 
Robbins* model maximizes the long run expected number of "successes 11 given 
two alternative courses of action and finite memory. Suppose that an 
experimenter has two coins and that he wishes to maximize the number of 
heads thrown during a sequence of tosses. The minimum variance unbiased 
estimator of p is x and as the number of repeated trials increases, the 
variance of the estimator goes to zero. Therefore if an experimenter 
had prior knowledge of the probability of heads for each coin he would 
use the. coin with the greater probability of heads exclusively and know 
with certainty that 

Limit number of heads in first n tosses = max(pi ,Ps) 
n -*o° n 

where p^ = probability of heads for the i coin. Without prior know- 
ledge, and constrained by a finite memory, the experimenter must decide 
which coin to use on the basis of the results of the previous r trials. 
Robbins formulates a decision rule in the proof of the following theorem: 
"Define the rule R^ as follows: start tossing with coin 1. Stop 
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if the first toss is tail, otherwise continue tossing until the first 
run of r successive tails occurs and then stop. This defines the 
first block of tosses with coin 1. Now start tossing with coin 2 
and apply the same rule, obtaining the first block of tosses with 
coin 2. Then start again with coin 1 and apply the same rule, obtain- 
ing the second block of tosses with coin 1, and so on indefinitely, 
thus generating an infinite sequence of tosses consisting of alter- 
nate blocks of tosses with coins 1 and 2. With rule R so defined, 

r 

we assert that 



Limit number of heads in first n tosses 
n-*»°° n 



Note that 



r r 

El ■ c b.. . + Pa .Jk. " 
r r 
qi + cfe 



Limit Pi qj + B3 qf = max (pi , Fte ) 

- 



r r 

Ti + qs 

Using methods similar to the Robbins model, Cover [l] developed a 
4-state memory algorithm for testing the hypothesis p > po vs p < po , 
given a sequence of iid Bernoulli random variables. In the Cover model, 
the pair (T,Q) can take values in { 0 , 1 } • T keeps track of the currently 
favored hypothesis and Q records the results of the current run test. 

Two sequences ts^ii and lr^j x of positive intergers are considered. The 
sequences of observations are divided into blocks S^, R^ , S ^ R^,.... 
where S. denotes the first s. observations, the next r. by R, , and so on. 
T is initially arbitrary; if all observations in a block are equal to 
1, Q is set to 1. If all observations in a block R^ are equal to 0 , Q 
is set to 1. At the end of each block the currently favored hypothesis 



9 



is updated by the rule 

T =1 if Q = 1 and n is at the end of an S block 
n 

=0 if Q = 1 and n is at the end of an R block 

= T i otherwise 
n- 1 

The lengths of the blocks of S’s and R* s are determined as a function of 
p^, and Cover shows that the limiting probability of error under either 
hypothesis is zero, (See Figure 1). 



pSi 




A two state Markov chain where T can take on values in { 0 , l} . 

Although the memory size in Cover* s model is now finite the updating rule 
still depends on n. 

The first genuine finite memory model has been proposed by Heilman and 
Cover [3], [4]. They proposed a family of algorithms of the type 



T n = f(T n _ 1 ,x n ) ; f: { 1,2, . . . ,m} XI -{ 1 ,2, . . . ,m) 

d n = d(T n ) ; d: {l,2, . . . ,m} - {5C,3T } 

T denotes the statistic at time n. x is the value of the n^ sample, 
f is a transition function and d is a decision function. Note that T 
is of finite memory since Te {l,2,...,m}; and given an initial value of 
the statistic, the sequence T^ forms a Markov chain over the state space 
M = {l,2,...,m}. The goal is to minimize the expected asymptotic pro- 
portion of errors 



n 



P< e > = E {Limit iYeJ 
n -*00 1=1 
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where 



, 1 if d. ^ H 
e _ J 1 true 

1 0 if d. = H 

i true 

Heilman and Cover have established a lower bound for the proportion of 
of error. Let Tt^ and Tf^. denote the prior probabilities of the null and 

i 

alternate hypotheses. Let and f^ be the probability densities of the 
sample under the respective hypothesis with respect to a dominating 
measure. Define the likelihood ratio to be JL (x) = £^(x)/£j (x) . Let 
l denote the ess sup of the likelihood ratio and the ess inf where the 

supremum and infimum are taken over all measureable sets with positive 
dominating measures. Define Y = J&/X. Then for an irreducible m-state 



automaton, P(e) ^ P* where 

2 ( W Ym ’ 1)% " 1: 

** ■ y”' 1 - 1 






. - m-1^ 

if Y ^ max 



otherwise . 






Heilman and Cover further prove that a reducible** (m+1) -state automaton 
obeys the same bound on P(e) as an irreducible m-state automaton. 

If the prior probabilities of the null and alternate hypotheses are 
equally likely, that is if = TF^ = %, then for an irreducible m-state 
automaton 

p* - i 

y "- 1 - 1 

_ V^ (m ~ 1) - 1 
Y ( “ -1 )- 1 

’ y^"-^ + 1 



** We call the automaton reducible (irreducible) if the Markov chain [t } 
is reducible (irreducible). 
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If the m-state automaton is reducible, the bound becomes at least as 

/ 

great as 



P(e) ^ 



1_ 

Y %(m- 2) 



+ 1 



In the case of the Bernoulli trials, consider the two hypotheses 



case 



3C : P = 

? : p = iVy where TT^ = TT^. = % 

Without loss of generality it may be assumed that > p in which case 

lyw 

i - ; L - V q j ; and Y " ~ 7^. • 

Further, if the hypothesis if symmetric, that is, if = 1-Pry> then 
v = (^ 2 



' q 3C 



; for an irreducible m-state automaton 
1 



P(e) £ 



■*© 



m- 1 



and for a reducible automaton 

1 



P(e) ^ 



1 + 



m-2 



While the lower bound cannot be achieved except in degenerate cases, 
Heilman and Cover demonstrate an e-optimal class of automata, that is, 
for every e>0 there exists an automaton such that P(e) ^ P* + e . 

Define : 

K = lxeX:'i(x) ^ [ (1 / 1 ) + e] 

? = [x el :£(x) £ (£ + e)} 

£ = fxel: x ? (K U )] 

e s e 

Let the transition function f, be specified as follows (see figure 2): 
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i + 1 


X e 


f(i,x) =■ 


i 


x e Sq for 2 ^ i ^ m-1 




i - 1 


x e 3 






€ 


f(l,x) =| 


f2 


with probability 6 > 0 if x e 5C 
otherwise 


f(m,x) =j 


fm-1 

L m 


with probability k6 > 0 if x e 
otherwise 





( 



Transitions are made to adjacent states only when the events K or 3 £ 
are observed. Thus the automaton enters an end state only on strong 
evidence to support that hypothesis. If 6 is allowed to become arbitrar- 
ily small, then the automaton tends to leave the end state with a very 
low probability. Decisions made in the end states have the least proba- 
bility of error, so as 6 0, the P(e) should asymptotically approach P*. 

While the Hellman^Cover algorithm is useful in producing sequences 
of decisions, the algorithm is not easily adapted to situations in which 

V v 

only a single decision is required. The irreducible automaton will 
asymptotically approach the lower bound for probability of error after 
a ff large enough 1 number of observations; however, there is no easily de- 
fined rule which would specify when this number had been reached. It 
should also be noted that the Hellman-Cover automaton requires artificial 
randomization for transitions out of the end states. Some ancillary 
mechanism must be provided to achieve this desired randomization. In the 
case of a small computer, additional core storage would probably be 



13 



required. The closer P(e) is to approach P*, the smaller is the prob- 
ability 6 which must be generated - which requires even more additional 
core storage. It is therefore believed that there are strong pragmatic 
reasons for adopting an algorithm with absorbing states despite higher 
asymptotic probability of error. 

In this paper a special class, a^, of symmetric (2n + 3)-state al- 
gorithms with two absorbing states will be developed. Derivations and 
proofs within the paper are restricted to symmetric Bernoulli random 
variables, but it would also be feasible to extend these concepts to 
non-symmetr ic hypotheses and to distributions other than Bernoulli. 



V 

\ 
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II. DESCRIPTION OF THE ALGORITHM 



Let X^ , X£> X^>... denote a sequence of independent identically dis- 
tributed Bernoulli random variables which can take on values H or T. 
Consider two hypotheses, JC and S' with equal prior probabilities and such 
that 

P(X X = H|30 = P(X L = T|S) = p, where % < p < 1. 



As the notation suggests, the sequence of random variables can be thought 
of as successive tosses of a coin which is biased towards Heads under 
hypothesis K or biased towards Tails under hypothesis 3. 

Define the algorithm (M,f,d) (See Figure 3) such that M = {-(n+1), -n 
-1,0,1, .. .n,n+lj with ±(n+l) the two absorbing states and 0 the init 
ial state; d(n+l) = JC, d(-(n+l)) =3', otherwise arbitrary; and the tran- 
sition function, f, such that 

f(s,H) = s+1, f(s,T) = s-p(s) 

f (s , T) = s - 1 , f(s,H) = s4p (s) 

f(s,H) = 1, f(s,T) = -1 

f(s,H) = s, f(s,T) = s 



if s = 1,2, ... ,n 
if s = -1, . . . ,-n 
if s = 0 
if s = ± (n+1) 



The integers p (1) , • . . ,p (n) satisfy the inequality 1 ^ p (s) ^ s. 



( 




An Algorithm f = (p (1) , . . . ,p (n)) € a^ 
Figure 3. 
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The specific form of the algorithm (M,f,d) will henceforth be denoted 
f = (p (1) , . . . ,p (n)) . Figure 4 shows the algorithm and the transition 
matrix for the case when n = 4 and f = (1,1, 2, 2). 



/ 



5 

4 

3 

2 

1 

0 

-1 

-2 

-3 

-4 

-5 

Figure 4: 



5 

1 

P 

0 

0 

0 

0 

0 

0 

0 

0 

0 



4 3 2 

I 

0 0 0 

0 0 q 

p 0 0 

0 p 0 

0 0 p 

0 0 0 

0 0 0 

0 0 0 

0 0 0 

0 0 0 

0 0 0 



1 0-1-2 -3 -4 -5 

0 0 0 0 0 0 0 

0 0 0 0 0 0 0 

q 0 0 0 0 0 0 

q 0 0 0 0 0 0 

0 q 0 0 0 0 0 

p 0 q 0 0 0 0 

0 p 0 q 0 0 0 

0 0 p 0 q 0 0 

0 0 p 0 0 q 0 

0 0 0 p 0 0 q 

0 0 0 0 0 0 1 



The transition matrix for the case where n = 4 and f = (1,1, 2, 2). 
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Ill . THE c -OPTIMAL STOCHASTIC AUTOMATON 



Although the class consists of deterministic algorithms only, it 
can be easily shown that with randomization, the He liman- Cover lower 

i 

bound could be approached arbitrarily closely. With the algorithm 
(M,f,d), the probability of error can be written 



P^ = \ Pr (absorption at -(n+l)|3C) + \ Pr (absorption at nH-1 1 U) . 

Let p|(JC) denote the absorption in state i without return to j given that 
hypothesis K is true and P^(3) denote the corresponding probability given 
where i = i(n+l). Then since 0 is a recurrent event 



P e = k P° (n+| . fK) + ^ P° +1 (tI) which by symmetry 



- P -(n + l)00 

Given JC, we know that P°^ n+ ^(K) + P n-fl = ^ 5 SO 



P = — 

e _o 



P : („+!)<*> 



p : ( „ + i ) l » + p „+lC k > 



-[ 



1 + 



Cl w -> - 1 



P -(n + D^ 



] 



( 1 ) 



Let f = ( 1 , 2 , 3 , . . . , n) and 1 > 6 > 0 and consider the absorption at n+1 . 

If a Head is observed, move to the next higher state with probability 6, 
otherwise remain in that state. If a Tail if observed, return to 0 since 
p (s) = s for all s. Since return to 0 is a recurrent event. 



D o ^ n+lcn+1 ( n+l\ n+2.n-fl /1 , ( n+2\ n+3~n+l 2 

P n+l W = P 6 + \1/ P 6 (1 “ 6) + l 2 / P 6 (1 “ 6) + 



n+1 r* n-fl 

= p 0 



[i + (CC 1 - 6 ) + ( n f-)p 2 a -?>) 2 + 
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Similarly, 



n-fl c 
= P 6 

n+l. 
= P 6 

rH-1 u 
= p 6 



n-fl 

n+l 

n+l 



K"f)p l (W) k 



k=0 



[1 - p(l-6) J 

(q + 6p) n 



p 



-(n+l) 



<39 = 



n+l .n+l 

q <5 



(p + 6q) 



Thus 



P 

e 




n+l 

2 

n+l 

q 



(p + 6q) n 

(q + 6p) n 



i 



-l 



Taking the limit as 6 -> 0, we have 



P 

e 




] 



-1 



which is the Hellman-Cover lower bound. Unfortunately this e -optimal 
automaton is of little practical use since the expected time to absorp- 
tion becomes infinite. 
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IV. DETERMINATION OF ERROR PROBABILITY 



To obtain an explicit expression for the probability of error in 
terms of the algorithm, we will first prove the following proposition. 

i 

Proposition 1 ; Let f = (p (1) , . . . ,p (n)) 

n+l -1 

p - — H 

f) 



Then P e =[l+(^) R„(f)J • V £) -F1F“ 



( 2 ) 



Where F(x,f) is a polynomial in x of degree less than or equal to n and 
with integral coefficients. With initial conditions F ^ = 0 and F q = 1, 
these polynomials satisfy the recurrence relationship 

(n) 



F (x, f ) = F (x,f) - (l-x)xP ^F - f .(x,f) 
n n-1 ’ n-l-p(n) 5 

where n = 1,2, ... . 

Proof: 

From (1) we know that 



(3) 



■[ 



i + 



n+l 00 n ‘I 
? -(n+l)^ 



3 



Next note that 



‘ P n+1 ^ ~ pP n+l^ and that 

P 1 - (50 = pV-, where V, is" the expected number of visits to 
n+l r l,n 1 ,n 

n before a visit to n+l or 0 given that the chain was started in state 1. 

From basic properties of Markov chains we know that V, [5] is the 

l,n 

th 1 

(l,n) entry of the fundamental matrix M = (I-Q) where Q = [p i is 

an n X n matrix with entries = p if j = i+1, P^j = q if j = i-p(i); 

and p^ = 0 otherwise. (See Figure 5). 
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0 P 0 

q 0 p 

0 q 0 

> 0 q 0 

0 q 0 

0 0 q 



0 0 

0 0 

P o 

0 p 

0 0 

0 0 



0 

0 

0 

0 

p 

0 



Figure 5: The matrix Q for the case when n = 6 and f = (1,1,1, 2, 3, 3) . 



Applying the formula for the inverse of a matrix, we have 
V l,n * '- 1 > n+l h-Ol' 1 

y 

t h 

ant of the (l,n) cofactor transposed. By deleting the first column and 
n row of the matrix (I-Q), the submatrix (I-Q) ^ ^ is lower triangular 
with entries -p on the diagonal. Hence its determinant is equal to (-p) n 
Substituting , 

v, = (-p) n_ 1 (-i ) n+1 Ii-qI " 1 

L y 11 

= p nl |i-Q |' 1 

If we denote 1 1 - Q | = F^(p,f) and repeat the entire argument with p and q 
interchanged , then 



1) 



| is the determin- 



where | * | is the determinant operation and | (I-Q) 



(n 



v i, n ■ p"’ 1 

V -l,-n ’ 'I 1 ’’ 1 CV^.OF 1 

Multiplying and substituting into (1), we have that 



P 

e 




n+1 F 

n 



F 

n 



(q,f) 

(p»f) 




which was to be proved. 
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